Instructions

This homework covers the linear models and diagnostics & transforms lectures. All work must be submitted electronically. For full credit you must show all of your steps. Use of computational tools (e.g., R) is encouraged; and when you do, code inputs and outputs must be shown in-line (not as an appendix) and be accompanied by plain English that briefly explains what the code is doing.

Problem 1: Estimation (20 pts)

Let \(Y_1, \dots, Y_n\) be independent Poisson random variables with mean \(\theta\).

  1. (5 pts) Derive the method of moments estimator for \(\theta\). ‘1a’
  2. (5 pts) Derive the maximum likelihood estimator \(\hat{\theta}_n\) for \(\theta\). How does this compare to what you found in part a.? ‘1b’
  3. (5 pts) Provide the asymptotic sampling distribution for \(\hat{\theta}_n\). ‘1c’
  4. (5 pts) With the following data values for \(y\) given in R below, what are the chances that the data could have been generated from a Poisson with parameter \(\theta = 5\)? Or \(\theta = 6\)? Or \(\theta = 7\)?
y <- c(3, 5, 6, 5, 2, 6, 6, 7, 8, 8, 7, 8, 0, 5, 7, 6, 6, 10, 6, 5, 6, 7, 6,  
    9, 8, 4, 8, 7, 11, 9, 4, 4, 7, 9, 8, 6, 5, 6, 12, 10, 7, 13, 8, 12, 9, 4, 
    10, 8, 4, 5)

pois <-function(y, theta){
  summ <- 0
  for(x in y){
    one <- (theta**x*exp(-theta))/factorial(x)
    summ <-summ + log(one)
  }
  -summ
}

pois(y, 5)
## [1] 133.5273
pois(y, 6)
## [1] 121.1733
pois(y, 7)
## [1] 118.4538

Problem 2: Tractors revisited (20 pts)

Revisit the tractor data in tractor.csv from Homework 1. If necessary, first re-calcuate your estimated coefficients from the least squares fit to the data.

  1. (10 pts) Estimate \(\sigma^2\), and use it to give a 95% prediction interval for a three year old tractor, taking uncertainty in your estimates of \(\beta_0\) and \(\beta_1\) into account.
tractor <- read.csv("tractor.csv")
attach(tractor)
summ = summary(reg <- lm(cost ~ age))
s <- summary(reg)$sigma**2
n <- length(cost)
#variance:
s
## [1] 96277.21
c <-confint(reg, level=0.95)
#conf int for 3 yrs
c(3*c[2]+c[1], 3*c[4]+c[3])
## [1]  182.1401 1330.0605
  1. (10 pts) On a plot including the data, provide a summary of the predictive distribution (i.e., predictive mean and 95% interval with both estimated from the data) for tractors ranging from brand new to ten years old.
xf <- data.frame(age=c(seq(1, 10)))
pred <- predict(reg, newdata=xf, interval="confidence")
plot(age, cost, xlim = c(1,10), ylim = c(0, 2200))
lines(pred[,1])
lines(pred[,2], col="red")
lines(pred[,3], col="red")

Problem 3: Leverage (20 pts)

In this problem we will add mathematical heft to the concept of the leverage of a data point.

  1. (15 pts) Show that the least squares fitted values \(\hat{y}_i\), for \(i=1,\dots, n\), can be written as a linear combination of all observed \(y_j\) values, for \(j=1,\dots,n\). That is, show that we can write \[ \hat{y}_i = \sum_{j=1}^n h_{ij} y_j \quad \mbox{ for } \quad i=1, \dots, n. \] Hint: \(h_{ii} = h_i\), the leverage of the \(i^{\mathrm{th}}\) data point from lecture. ‘2a’

  2. (5 pts) Now, show that the derivative of \(\hat{y}_i\) with respect to \(y_i\) is the leverage \(h_i\). I.e., \(\frac{d \hat{y}_i}{d y_i} = h_i\). ‘2b’ ### Problem 4: Transforms (20 pts)

The file transforms.csv contains 4 pairs of \(x\)s and \(y\)s. For each pair:

  1. Fit the linear regression model \(Y = \beta_0 + \beta_1 x + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\). Plot the data and add the fitted line.
transforms <- read.csv("~/Documents/Documents/Machine Learning/transforms.csv", header = TRUE)
plot(transforms$X1, transforms$Y1, xlab = "X1", ylab = "Y1", col= 1); abline(lm(transforms$Y1 ~ transforms$X1), col = 1)

plot(transforms$X2, transforms$Y2, xlab = "X2", ylab = "Y2", col=2); abline(lm(transforms$Y2 ~ transforms$X2), col = 2)

plot(transforms$X3, transforms$Y3, xlab = "X3", ylab = "Y3", col=3); abline(lm(transforms$Y3 ~ transforms$X3), col = 3)

plot(transforms$X4, transforms$Y4, xlab = "X4", ylab = "Y4", col=4); abline(lm(transforms$Y4 ~ transforms$X4), col = 4)

  1. Provide a scatterplot, normal Q-Q plot, and histogram for the studentized regression residuals.
reg1 <- lm(transforms$Y1 ~ transforms$X1)
hist(rstudent(reg1), main="", col = 1); qqnorm(rstudent(reg1), col=1, main=""); plot(reg1$fitted, rstudent(reg1), ylab="student residuals", xlab = "fitted", col=1)

reg2 <- lm(transforms$Y2 ~ transforms$X2)
hist(rstudent(reg2), main="", col = 2); qqnorm(rstudent(reg2), col=2, main=""); plot(reg2$fitted, rstudent(reg2), ylab="student residuals", xlab = "fitted", col=2)

reg3 <- lm(transforms$Y3 ~ transforms$X3)
hist(rstudent(reg3), main="", col = 3); qqnorm(rstudent(reg3), col=3, main=""); plot(reg3$fitted, rstudent(reg3), ylab="student residuals", xlab = "fitted", col=3)

reg4 <- lm(transforms$Y4 ~ transforms$X4)
hist(rstudent(reg4), main="", col = 4); qqnorm(rstudent(reg4), col=4, main=""); plot(reg4$fitted, rstudent(reg4), ylab="student residuals", xlab = "fitted", col=4)

  1. Using the residual scatterplots, state how the SLR model assumptions are violated.
  1. The additive errors are not identically distributed (the scatter plot is bunched up at the beginning and then the points fan out which means the variance is increasing in x)
  2. The mean is not linear, additive errors do not have gaussian distribution
  3. The mean is not linear, additive errors do not have gaussian distribution
  4. Additive errors do not have gaussian distribution and they are not identically distributed
  1. Determine the data transformation to correct the problems in (iii), fit the corresponding regression model, and plot the transformed data with new fitted line.
  1. Log transform:
lreg <- lm(log(transforms$Y1) ~ transforms$X1);
plot(transforms$X1, log(transforms$Y1)); abline(lreg, col="red")

X2_2 <- transforms$X2^2
nlreg <- lm(transforms$Y2~ transforms$X2 + X2_2)
xgrid = data.frame(transforms$X2, transforms$X2^2)
plot(transforms$X2, transforms$Y2, pch=20, col=4)
pred = predict(nlreg, newdata=xgrid)
lines(xgrid$transforms.X2, pred, col=2, lty=2)

X3_3 <- transforms$X3^2
logreg <- lm(transforms$Y3 ~ log(transforms$X3) + X3_3);
plot(log(transforms$X3), log(transforms$Y3))
## Warning in log(transforms$Y3): NaNs produced
xgrid2 = data.frame(transforms$X3, transforms$X3^2)
pred = predict(logreg, newdata=xgrid2)
lines(xgrid2$transforms.X3, pred, col=2, lty=2)

lreg2 <- lm(log(transforms$Y4) ~ log(transforms$X4));
plot(log(transforms$X4), log(transforms$Y4)); abline(lreg2, col="red")

  1. Provide plots to show that your transformations have (mostly) fixed the model violations.
plot(lreg$fitted, rstudent(lreg)); abline(h=0, col=1, lty=2)

plot(nlreg$fitted, rstudent(nlreg)); abline(h=0, col=1, lty=2)

plot(logreg$fitted, rstudent(logreg)); abline(h=0, col=1, lty=2)

plot(lreg2$fitted, rstudent(lreg2)); abline(h=0, col=1, lty=2)

Each of i.–v. for each of four pairs will be assigned 1 point.

Problem 5: Newspapers (20 pts)

Data were collected on the average Sunday and daily (i.e., weekday) circulations (in thousands) for 48 of the top 50 newspapers in the United States for the period March–September, 1993. See newspaper.csv.

  1. (5 pts) Fit a regression line predicting Sunday circulation from daily circulation, then overlay the fitted line and a 95% predictive interval on a scatterplot of the data.
newspaper <- read.csv("~/Documents/Documents/Machine Learning/newspaper.csv")
attach(newspaper)
summ = summary(reg <- lm(Sunday ~ daily))
xf <- data.frame(daily=c(seq(1, 2000)))
pred <- predict(reg, newdata=xf, interval="prediction")
plot(daily, Sunday)
lines(pred[,1])
lines(pred[,2], col="red")
lines(pred[,3], col="red")

  1. (5 pts) Report a \(95\%\) confidence interval for \(\beta_1\) and interpret its meaning in terms of newspaper circulation. Is there a significant relationship between Sunday circulation and daily circulation? Justify your answer by a statistical test. Indicate what hypothesis you are testing and your conclusion.
reg <- lm(Sunday ~ daily)
confint(reg, level= 0.95)
##                 2.5 %     97.5 %
## (Intercept) -4.344248 156.363861
## daily        1.125645   1.427701
zb1 <- (1.27667 - 1)/0.07503
2*pt(-abs(zb1), df=46)
## [1] 0.0005965783

Hypothesis: Daily circulation is a predictor of amount of Sunday circulation. Since the b1 z value is < .05, this means that yes, the daily circulation is a predictor.

  1. (5 pts) Argue that working with the logarithm of the circulation(s) might be better than using the raw numbers. Repeat (a.) with the corresponding log-log regression model. That is, provide the fitted line and 95% predictive interval overlayed on the data.

The logarithm is used because the data is mostly centered at the lower end of the lower end of the x values.

reg1 <- lm(log(Sunday) ~ log(daily))
xf <- data.frame(daily=c(seq(1, 2000)))
pred1 <- predict(reg1, newdata=xf, interval="prediction")
plot(log(daily), log(Sunday))
lines(pred1[,1])
lines(pred1[,2], col="red")
lines(pred1[,3], col="red")

  1. (5 pts) Repeat (b.) with confidence intervals and tests on the elasticity \(\beta_1\). Provide interpretation and clearly state your hypothesis and conclusions.
reg2 <- lm(log(Sunday) ~ log(daily))
confint(reg2, level= 0.95)
##                  2.5 %  97.5 %
## (Intercept) -0.2432072 1.49705
## log(daily)   0.8096909 1.10108
zb1 <- (1.27667 - 1)/0.07503
2*pt(-abs(zb1), df=46)
## [1] 0.0005965783

Hypothesis: Daily circulation is a predictor of amount of Sunday circulation. Since the b1 z value is < .05, this means that yes, the daily circulation is a predictor. Sunday increses by 1% for every .95% change in daily.